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Abstract 

Recent experimental imaging techniques are able to tag and count molecular populations in a living cell. From these 
data mathematical models are inferred and calibrated. If small populations are present, discrete-state stochastic 
models are widely-used to describe the discreteness and randomness of molecular interactions. Based on time-series 
data of the molecular populations, the corresponding stochastic reaction rate constants can be estimated. This 
procedure is computationally very challenging, since the underlying stochastic process has to be solved for different 
parameters in order to obtain optimal estimates. Here, we focus on the maximum likelihood method and estimate 
rate constants, initial populations and parameters representing measurement errors. 



Introduction 

During the last decade stochastic models of networks 
of chemical reactions have become very popular. The 
reason is that the assumption that chemical concentra- 
tions change deterministically and continuously in time is 
not always appropriate for cellular processes. In particu- 
lar, if certain substances in the cell are present in small 
concentrations the resulting stochastic effects cannot be 
adequately described by deterministic models. In that 
case, discrete-state stochastic models are advantageous 
because they take into account the discrete random nature 
of chemical reactions. The theory of stochastic chemi- 
cal kinetics provides a rigorously justified framework for 
the description of chemical reactions where the effects 
of molecular noise are taken into account [1]. It is based 
on discrete-state Markov processes that explicitly repre- 
sent the reactions as state-transitions between population 
vectors. When the molecule numbers are large, the solu- 
tion of the deterministic description of a reaction network 
and the mean of the corresponding stochastic model agree 
up to a small approximation error. If, however, species 
with small populations are involved, then only a stochastic 
description can provide probabilities of events of inter- 
est such as probabilities of switching between different 
expression states in gene regulatory networks or the dis- 
tribution of gene expression products. Moreover, even the 
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mean behavior of the stochastic model can largely devi- 
ate from the behavior of the deterministic model [2]. In 
such cases the parameters of the stochastic model rather 
then the parameters of the deterministic model have to be 
estimated [3-5]. 

Here, we consider noisy time series measurements of 
the system state as they are available from wet-lab exper- 
iments. Recent experimental imaging techniques such 
as high-resolution fluorescence microscopy can measure 
small molecule counts with measurement errors of less 
than one molecule [6]. We assume that the structure of 
the underlying reaction network is known but the stochas- 
tic reaction rate constants of the network are unknown 
parameters. Then we identify rate constants that maxi- 
mize the likelihood of the time series data. Maximum like- 
lihood estimators are the most popular estimators since 
they have desirable mathematical properties. Specifically, 
they become minimum variance unbiased estimators and 
are asymptotically normal as the sample size increases. 

Our main contribution consists in devising an efficient 
algorithm for the numerical approximation of the likeli- 
hood and its derivatives w.r.t. the stochastic reaction rate 
constants. Furthermore, we show how similar techniques 
can be used to estimate the initial molecule numbers 
of a network as well as parameters related to the mea- 
surement error. We also present extensive experimental 
results that give insights about the identifiability of cer- 
tain parameters. In particular, we consider a simple gene 
expression model and the identifiability of reaction rate 
constants w.r.t. varying observation interval lengths and 
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varying numbers of time series. Moreover, for this system 
we investigate the identifiability of reaction rate constants 
if the state of the gene cannot be observed but only 
the number of mRNA molecules. For a more complex 
gene regulatory network, we present parameter estima- 
tion results where different combinations of proteins are 
observed. In this way we reason about the sensitivity of the 
estimation of certain parameters w.r.t. the protein types 
that are observed. 

Previous parameter estimation techniques for stochas- 
tic models are based on Monte-Carlo sampling [3,5] 
because the discrete state space of the underlying model 
is typically infinite in several dimensions and a priori 
a reasonable truncation of the state space is not avail- 
able. Other approaches are based on Bayesian inference 
which can be applied both to deterministic and stochastic 
models [7-9]. In particular, approximate Bayesian infer- 
ence can serve as a way to distinguish among a set 
of competing models [10]. Moreover, in the context of 
Bayesian inference linear noise approximations have been 
used to overcome the problem of large discrete state 
spaces [11]. 

Our method is not based on sampling but directly 
calculates the likelihood using a dynamic truncation of 
the state space. More precisely, we first show that the 
computation of the likelihood is equivalent to the evalu- 
ation of a product of vectors and matrices. This product 
includes the transition probability matrix of the associated 
continuous-time Markov process, i.e., the solution of the 
Kolmogorov differential equations (KDEs), which can be 
seen as a matrix-version of the chemical master equation 
(CME). Solving the KDEs is infeasible because of the state 
space of the underlying Markov model is very large or even 
infinite. Therefore we propose an iterative approximation 
algorithm during which the state space is truncated in an 
on-the-fly fashion, that is, during a certain time interval 
we consider only those states that significantly contribute 
to the likelihood. This technique is based on ideas pre- 
sented in [12], but here we additionally explain how the 
initial molecule numbers can be estimated and how an 
approximation of the standard deviation of the estimated 
parameters can be derived. Moreover, we provide more 
complex case studies and run extensive numerical exper- 
iments to assess the identifiability of certain parameters. 
In these experiments we assume that not all molecular 
populations can be observed and estimate parameters for 
different observation scenarios, i.e., we assume different 
numbers of observed cells and different observation inter- 
val lengths. We remark that this article is an extension of 
a previously published extended abstract [13]. 

The article is further organized as follows: After 
introducing the stochastic model in Section"Discrete- 
state stochastic model" we discuss the maximum like- 
lihood method in Section "Parameter inference" and 



present our approximation method in Section "Numerical 
approximation algorithm". Finally, we report on exper- 
imental results for two reaction networks in Section 
"Numerical results". 

Discrete-state stochastic model 

According to Gillespie's theory of stochastic chemical 
kinetics, a well-stirred mixture of n molecular species in a 
volume with fixed size and fixed temperature can be rep- 
resented as a continuous-time Markov chain [X(t), t > 0} 
[1]. The random vector X(f) = (Xi(t), . . . ,X n (t)) 
describes the chemical populations at time t, i.e., Xi(t) is 
the number of molecules of type i e {1, ...,«} at time 
t. Thus, the state space of X is %\ = {0, 1, . . .} n . The 
state changes of X are triggered by the occurrences of 
chemical reactions, which are of m different types. For 
j e {1, . . . , m] let vy e Z n be the nonzero change vector of 
the y-th reaction type. Thus, if X(t) = x and the y-th reac- 
tion is possible in x, then X(t + dt) = x + vy is the state of 
the system after the occurrence of the y-th reaction within 
the infinitesimal time interval [ t, t + dt). 

Each reaction type has an associated propensity func- 
tion, denoted by a\, . . . , a m , which is such that Qfy(x) • dt 
is the probability that, given X(t) = x, one instance of 
the y-th reaction occurs within [ t, t + dt). The value ay(x) 
is proportional to the number of distinct reactant com- 
binations in state x and to the reaction rate constant cj. 
The probability that a randomly selected pair of reactants 
collides and undergoes the y-th chemical reaction within 
[t,t + dt) is then given by cjdt. The value cj depends on 
the volume and the temperature of the system as well as 
on the microphysical properties of the reactant species. 

Example L We consider the simple gene expression 
model described in [4] that involves three chemical 
species, namely DNAcm, DNAoff> and mRNA, which 
are represented by the random variables X\(t), X2(t), 
and Xs(t), respectively The three possible reactions 
are DAL4on -> DNAoff, DNAoff -> DNAcn, and 
DNA ON -> DNA ON + mRNA. Thus, vi = (-1, 1, 0), v 2 = 
(1, —1,0), V3 = (0,0, 1). For a state x = (x\ 1 X2>x%), the 
propensity functions are a\ (x) = c\ -x\, Gf2(x) = <?2 '%2> and 
a 3 (x) = C3 -x\. Note that given the initial state x = (1, 0, 0), 
at any time, either the DNA is active or not, i.e. x\ = 0 and 
x 2 = 1, or x\ = 1 and X2 = 0. Moreover, the state space of 
the model is infinite in the third dimension. For a fixed time 
instant t > 0, no upper bound on the number of mRNA 
is known a priori. All states x with X3 e Z + have positive 
probability ift > 0 but these probabilities will tend to zero 
asx% -> 00. 

The CME 

For a state x e 7L\ and t > 0, let y?(x, t) denote the proba- 
bility Pr(X(£) = x), i.e., the probability that the process is 
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in state x at time t. Furthermore, let p(£) be the row vector 
with entries p(x, t) where we assume a fixed enumeration 
of all possible states. 

Given vi, . . . ,v m , a\, . . . ,a m , and some initial popula- 
tions x(0) = (*i(0), . . .,x n (0)) with P(X(0) = x(0)) = 1, 
the Markov chain X is uniquely specified and its evolution 
is given by the CME 

^p(f)=p(f)Q, (1) 

where Q is the infinitesimal generator matrix of X with 
Q(x, y) = a/(x) if y = x + \j and reaction type ; is possible 
in state x. Note that, in order to simplify our presentation, 
we assume here that all vectors v/ are distinct. All remain- 
ing entries of Q are zero except for the diagonal entries 
which are equal to the negative row sum. The ordinary 
first-order differential equation in (1) is a direct conse- 
quence of the Kolmogorov forward equation but standard 
numerical solution techniques for systems of first-order 
linear equations cannot be applied to solve (1) because 
the number of nonzero entries in Q typically exceeds the 
available memory capacity for systems of realistic size. If 
the expected populations of all species remain small (at 
most a few hundreds) then the CME can be efficiently 
approximated using projection methods [14-16] or fast 
uniformization methods [17,18]. The idea of these meth- 
ods is to avoid an exhaustive state space exploration and, 
depending on a certain time interval, restrict the analysis 
of the system to a subset of states. 

We are interested in the partial derivatives of p(£) w.r.t. a 
certain parameter X such as reaction rate constants Cj, j e 
{1, . . . , m] or initial populations Xi(0),i e {1, . . . , n). Later, 
they will be used to maximize the likelihood of observa- 
tions and to find optimal parameters. In order to explicitly 
indicate the dependence of p(£) on X we may write px(t) 
instead of p(£) and px (x, t) instead of p(x, t). We define the 
row vector sx(t) as the derivative of pi(t) w.r.t. X, i.e., 

dp k (t) Pa.+a(*) -PaXO 

sx(t) = — = lim A ^ 0 • 

dX A 

We denote the entry in (t) that corresponds to state x by 

sx (x, t). Note that we use bold face for vectors. By (1), we 

find that (t) is the solution of the system of ODEs 



d d 
—sx(t) = sx(t)Q + Px(t)—Q, 
at oA 



(2) 



when choosing X = cj for j e {1, . . . , m}. In this case, the 
initial condition is 5^(x, 0) = 0 for all x since p(x, 0) is 
independent of cj. If the unknown parameter is the i-th 
initial population, i.e., X = Xi(0), then we get 

jSx(t) = sx(t)Q, (3) 

with initial condition s^(0) = ^p^(O) since Q is inde- 
pendent of Xi(Q). Similar ODEs can be derived for higher 
order derivatives of the CME. 



Parameter inference 

Following the notation in [4], we assume that observa- 
tions of the reaction network are made at time instances 
t\,...,tR e R>o where t\ < • • • < Since it is unre- 
alistic to assume that all species can be observed, we 
assume w.Lo.g. that the species are ordered such that we 
have observations of X\, . . .,Xd for some fixed d with 
1 < d < n, i.e. 0{(ti) is the observed number of species 
i at time tg for i e {1, . . . , d} and i e {1, . . . ,R}. Let 
0(ti) = (Oi(^), . . . , Od{ti)) be the corresponding vec- 
tor of observations. Since these observations are typically 
subject to measurement errors, we assume that 0/(^) = 
Xi(ti)+€i(ti) where the error terms €i(ty) are independent 
and identically normally distributed with mean zero and 
standard deviation o. Note that Xi(ti) is the true popula- 
tion of the i-th species at time tg. Clearly, this implies that, 
conditional on Xi{tg), the random variable 0/(^) is inde- 
pendent of all other observations as well as independent 
of the history of X before time tg. 

We assume further that we do not know the values of 
the rate constants c = (<?i,...,c m ) and our aim is to 
estimate these constants. Similarly, the initial populations 
x(0) and the exact standard deviation a of the error terms 
are unknown and must be estimated. We remark that 
it is straightforward to extend the estimation framework 
such that a covariance matrix for a multivariate normal 
distribution of the error terms is estimated. In this way, 
different measurement errors of the species can be taken 
into account as well as dependencies between error terms. 

Let / denote the joint density of 0(£i), . . . , Ofe) and, 
by convenient abuse of notation, for a vector xg = 
(xi,...,Xd) let X(tg) = xi represent the event that 
Xi(tg) = Xi for 1 < / < d. In other words, X(^) = xg 
means that the populations of the observed species at time 
tg equal the populations of vector xg. Note that this event 
corresponds to a set of states of the Markov process since 
d may be smaller than n. More precisely, Pr (X(^) = x^) = 
J2y:yi=xi,i<dP(Y> Now the likelihood of the observation 
sequence 0(£i), . . . , 0(£#) is given by 

£=/(0(fi),...,0(fe)) 

X(fi) =xi,...,X(fij) = x R ) 
Pr(X(fi) = xi,...,X(to)=xu). 



Note that C depends on the chosen rate parameters c and 
the initial populations x(0) since the probability measure 
Pr(-) does. Furthermore, C depends on a since the density 
/ does. When necessary, we will make this dependence 
explicit by writing £(x(0), c, a) instead of £. We now 
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seek constants c*, initial populations x(0) and a standard 
deviation a* such that 



£(x(0)*,c*,<r*) = max £(x(0), c, a) 

x(0),<r,c 



(5) 



where the maximum is taken over all a > 0 and vectors 
x(0), c with all components strictly positive. This opti- 
mization problem is known as the maximum likelihood 
problem [19]. Note that x(0)*, c* and a* are random vari- 
ables because they depend on the (random) observations 
0(fi),...,0(fc). 

If more than one sequence of observations is made, 
then the corresponding likelihood is the product of 
the likelihoods of all individual sequences. More pre- 
cisely, if O k (ti) is the /c-th observation that has been 
observed at time instant t\ where k e {1, . . .,/<T}, then 
we define /^(xiT)), c, a) as the probability to observe 
O k (ti),..., O k (t R ) and maximize 



K 



f]A(x(0),C,or). 

k=l 



(6) 



In what follows, we concentrate on expressions for 
£/:(x(0), c, a) and ^£k( x (Q)> c, a). We first assume K = 1 
and drop index /<. We consider the case K > 1 later. 
In (4) we sum over all population vectors xi, . . . ,x R of 
dimension d such that Pr(X(^) = xi, 1 < i < R) > 0. 
Since X has a large or even infinite state space, it is com- 
putationally infeasible to explore all possible sequences. 
In Section "Numerical approximation algorithm" we pro- 
pose an algorithm to approximate the likelihoods and 
their derivatives by dynamically truncating the state space 
and using the fact that (4) can be written as a product of 
vectors and matrices. Let 0 a be the density of the nor- 
mal distribution with mean zero and standard deviation 
a. Then 

/ (O(fi), . . . , 0(t R ) | X(fi) = xi, . . . , X(t R ) = x R ) 

R d 

= \\Y\f{O i (t l )\X i (t l )=x il ) 

1=1 1=1 

R d 



1=1 i=l 



where xi = (xu, . . . ,Xdt). If we write w(xi) for 
Y\f=i <Pa(Qi(ti) - Xii), then the sequence xi, . . . ,x R has 
"weight" nf=i w ( x i) an d> thus, 



£ = I] • • • E Pr ( x ^i) = x *< • • • > = x «> fl w(x ^- 



XI x R 



(7) 



Moreover, for the probability of the sequence xi, . 
we have 



Vr(X(t 1 )=x 1 ,...,X(t R )=x R )=p(x 1 ,t 1 )P 2 (x 1 ,x 2 )... 
Pr(x R -i,x r ) 

where Pi(x,y) = Pr(Xfe) = y | X(fc_i) = x) for d- 
dimensional population vectors x and y. Hence, (7) can be 
written as 



£ = ^2p(*i> h)w(xi) 5^i > 2(xi,x 2 )w(x 2 ) . . . 



xi 



X2 



^2P R (x R -i,x R )w(x R ). 



(8) 



x# 



Assume that d = n and let Pi be the matrix with entries 
Pi (x, y) for all possible states x, y. Note that Pi is the tran- 
sition probability matrix of X for time step ti — ti-\ and 
thus the general solution gQ(^-^-i) of the Kolmogorov 
forward and backward differential equations 



-Pi = QP £ , 



dt 



Pi=PiQ. 



In this case, using p(£i) = p(to)Pi with to = 0, we can 
write (8) in matrix-vector form as 

C = p(t 0 )Pi W X P 2 W 2 ...Pr W R e. (9) 

Here, e is the vector with all entries equal to one and Wi 
is a diagonal matrix whose diagonal entries are all equal to 
w(xi) with i e {1, ... ,R}> where Wi is of the same size as 
Pi. 

If d < n, then we still have the same matrix-vector 
product as in (9), but define the weight w(x) of an n- 
dimensional population vector as 

d 

w(xi,...,x n ) = Y\^a(Oi(ti) - Xi), 

i=l 

i.e. the populations of the unobserved species have no 
influence on the weight. 

Since it is in general not possible to analytically obtain 
parameters that maximize C, we use numerical optimiza- 
tion techniques to find c*, x(0)* and a*. Typically, such 
techniques iterate over values of c, x(0) and a and increase 
the likelihood £(c, a) by following the gradient. There- 
fore, we need to calculate the derivatives j^C, ^j^C and 

T-C. For J-Cwe obtain 



9^(0) 



da 



d 

dCj 



3 c, 



c = 



d 

dCj 



(p(t 0 )P 1 W 1 P 2 W 2 ...P R W R e) 



(10) 



e. 



The derivative of C w.r.t. Xi(0) and a is derived analo- 
gously. The only difference is that p(£n) is dependent on 
Xi(0) and P\, . . . , P R are independent of a but W\, . . . , W R 
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depend on a. It is also important to note that expressions 
for partial derivatives of second order can be derived in 
a similar way. These derivatives can then be used for an 
efficient gradient-based local optimization. 

For K > 1 observation sequences we can maximize the 
log-likelihood 

K K 

log]~[A = I>gA, (11) 

k=l k=l 

instead of the likelihood in (6). Note that the derivatives 
are then given by 

where X is c/, Xi(0) or a. It is also important to note 
that only the weights w(xi) depend on /<, that is, on 
the observed sequence O k (h), • • • , O k (t R ). Thus, when we 
compute Ck based on (9) we use for all k the same tran- 
sition matrices P\, . . . ,Pr and the same initial conditions 
p(£o)> but possibly different matrices W\, . . . , Wr. 

Numerical approximation algorithm 

In this section, we focus on the numerical approxima- 
tion of the likelihood and the corresponding derivatives. 
Our algorithm calculates an approximation of the likeli- 
hood based on (9) by traversing the matrix-vector product 
from the left to the right. The main idea behind the algo- 
rithm is that instead of explicitly computing the matrices 
Pi, we express the vector-matrix product u(ti-\)Pi as 
a system of ODEs similar to the CME (cf. Equation (1)). 
Note that even though Pi is sparse the number of states 
may be very large or infinite, in which case we cannot 
compute Pi explicitly. Let u(£n)> • • • > u(£#) be row vectors 
that are obtained during the iteration over time points 
to,... t tR, that is, we define C recursively as C = u(£#)e 
with u(£n) = p(^o) and 

u(ti) = u(ti-!)Pi Wi for all 1 < i < R, 

where to = 0. We solve R systems of ODEs 

j t m = mQ d3) 

with initial condition u(^-i) = u(^_i) for the time inter- 
val [ ty-i, ti) where i <E {1, ... ,R}. After solving the €-th 
system of ODEs we set u(ti) = u(ti)Wi and finally com- 
pute C = u(£#)e. We remark that this is the same as 
solving the CME for different initial conditions and due 
to the largeness problem of the state space we use the 
dynamic truncation of the state space that we proposed in 
previous work [17]. The idea is to consider only the most 
relevant equations of the system (13), i.e., the equations 
that correspond to those states x where the relative contri- 
bution u(x, t)/(ii(ti)e) is greater than a threshold 8. Since 



during the integration the contribution of a state might 
increase or decrease we add/remove equations on-the-fly 
depending on the current contribution of the correspond- 
ing state. Note that the structure of the CME allows us to 
determine in a simple way which states will become rele- 
vant in the next integration step. For a small time step of 
length h we know that the probability being moved from 
state x — \j to x is approximately aj(x — \j)h. Thus, we can 
simply check whether a state that receives a certain proba- 
bility inflow receives more than the threshold. In this case 
we consider the corresponding equation in (13). Other- 
wise, if a state does not receive enough probability inflow, 
we do not consider it in (13). For more details on this 
technique we refer to [17]. 

Since the vectors u(ti) do not sum up to one, we scale 
all entries by multiplication with l/(u(^)e). This simpli- 
fies the truncation of the state space using the significance 
threshold 8 since after scaling it can be interpreted as 
a probability. In order to obtain the correct (unsealed) 
likelihood, we compute C as C = Y\i=i ufe)e. For our 
numerical implementation we used a threshold of 8 = 
10 -15 and handle the derivatives of C in a similar way. To 
shorten our presentation, we only consider the derivative 
j^C in the sequel of the article. Iterative schemes for J^C 

and 3^ 0 ) £> are derived analogously. From (10) we obtain 

j^C = uj(tR)e with u/(£o) = 0 and 

d 

vijiti) = (uj(fi-i)Pi+u(ti-i)—Pi)Wi for all 1 < i < R, 

' ' dej 

where 0 is the vector with all entries zero. Thus, during the 
solution of the €-th ODE in (13) we simultaneously solve 

^Uj(t)=Uj(t)Q + m^-Q (14) 
at dej 

with initial condition uj(ti-i) = u/(^_i) for the time 
interval [ ty-i, ti). As above, we set uj(ti) = uj(ti)Wi and 
obtain j^C as u/(£#)e. 

Solving (13) and (14) simultaneously is equivalent to the 
computation of the partial derivatives in (2) with differ- 
ent initial conditions. Numerical experiments show that 
the approximation errors of the likelihood and its deriva- 
tives are of the same order of magnitude as those of the 
transient probabilities and their derivatives. For instance, 
for a finite-state enzymatic reaction system that is small 
enough to be solved without truncation we found that the 
maximum absolute error in the approximations of the vec- 
tors p(£) and sx(t) is 10 -8 if the truncation threshold is 
8 = 10 -15 (details not shown). 

In the case of K observation sequences we repeat the 
above algorithm in order to sequentially compute for 
k e {1, . . .,/<C}. We exploit (11) and (12) to compute the 
total log-likelihood and its derivatives as a sum of indi- 
vidual terms. In a similar way, second derivatives can be 
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approximated. Obviously, it is possible to parallelize the 
algorithm by computing Ck in parallel for all k. 

In order to find values for which the likelihood becomes 
maximal, global optimization techniques can be applied. 
Those techniques usually use a heuristic for different ini- 
tial values of the parameters and then follow the gradient 
to find local optima of the likelihood. In this step the algo- 
rithm proposed above is used since it approximates the 
gradient of the likelihood. The approximated global opti- 
mum is then chosen as the minimum/maximum of the 
local optima, i.e, we determine those values of the param- 
eters that give the largest likelihood. Clearly, this is an 
approximation and we cannot guarantee that the global 
optimum was found. Note that this would also be the case 
if we could compute the exact likelihood. If, however, a 
good heuristic for the starting points is chosen and the 
number of starting points is large, then it is likely that 
the approximation is accurate. Moreover, since we have 
approximated the second derivative of the log-likelihood, 
we can compute the entries of the Fisher information 
matrix and use this to approximate the standard deviation 
of the estimated parameters, i.e., we consider the square 
root of the diagonal entries of the inverse of a matrix H 
which is the Hessian matrix of the negative log-likelihood. 
Assuming that the second derivative of the log-likelihood 
is computed exactly, these entries asymptotically tend to 
the standard deviations of the estimated parameters. 

We remark that the approximation proposed above 
becomes unfeasible if the reaction network contains 
species with high molecule numbers since in this case the 
number of states that have to be considered is very large. A 
numerical approximation of the likelihood is, as the solu- 
tion of the CME, only possible if the expected populations 
of all species remain small (at most a few hundreds) and 
if the dimension of the process is not too large. More- 
over, if many parameters have to be estimated, the search 
space of the optimization problem may become unfeasi- 
bly large. It is however straightforward to parallelize local 
optimizations starting from different initial point. 

Numerical results 

In this section we present numerical results of our param- 
eter estimation algorithm applied to two models, the sim- 
ple gene expression in Example 1 and a multi-attractor 
model. The corresponding SBML files are provided as 
Additional files 1 and 2. For both models, we generated 
time series data using Monte-Carlo simulation where we 
added white noise to represent measurement errors, i.e. 
we added random terms to the populations that follow a 
normal distribution with mean zero and a standard devia- 
tion of a. Our algorithm for the approximation of the like- 
lihood is implemented in C++ and linked to MATLABs 
optimization toolbox [20] which we use to minimize the 
negative log-likelihood. The global optimization method 



(Matlabs GlobalSearch [21]) uses a scatter-search algo- 
rithm to generate a set of trial points (potential starting 
points) and heuristically decides when to perform a local 
optimization. We ran our experiments on an Intel Core U 
at 2.8 GHz with 8 GB main memory. 

Simple gene expression 

For our first model, the simple gene expression as intro- 
duced in Example 1, we chose the same parameters 
as Reinker et al.[4] multiplied by a factor of 10, i.e., 
c = (0.270, 1.667, 4.0) and as the initial condition we 
have ten mRNA molecules and the DNA is inactive. We 
generated K observation sequences of length T = 100.0 
and observed all species at R equidistant observation time 
points. We added white noise with standard deviation 
o — 1.0 to the observed mRNA molecule numbers at each 
observation time point. For the case K = 5,R = 100 we 
plot the generated observation sequences in Figure 1. We 
estimated the reaction rate constants, the initial molecule 
numbers, and the parameter a of the measurement errors 
for the case K = 5,R = 100 where we chose the inter- 
val [ 10 -5 , 10 3 ] as a constraint for the rate constants, 
the interval [ 0, 100] for the initial number of mRNA 
molecules and [ 0, 5] for a. Since we use a global optimiza- 
tion method, the running time of our method depends 
on the number of trial points generated by GlobalSearch. 
In Figure 2 we plot the trial points (red points) and local 
optimization runs (differently colored lines) for the case 
of 10 (a), 100 (b) and 1000 (c) trial points. The intersec- 
tion of the dashed blue lines represents the location of 
the original parameters. In the case of ten trial points, the 
running time was about one minute and the local opti- 
mization was performed only once. In the case of 100 and 
1000 trial points, the running times were about 22min 
and 1.9 h, respectively and several local optimization runs 
converged in nearly the same point. However, we remark 
that in general the landscape of the target function might 
have multiple local minima and require more trial points 
resulting in longer running times. 

We ran experiments for varying values of K and R 
(K, R e {1, 2, 5, 10, 20, 50, 100}) to get insights whether 
for this network it is more advantageous to have many 
observation sequences with long observation intervals or 
few observation sequences with a short time between 
two successive observations. In addition, we ran the same 
experiments with the restriction that only the number of 
mRNA molecules was observable but not the state of the 
gene. In both cases we approximated the standard devia- 
tions of our estimators as a measure of quality by repeating 
our estimation procedure 100 times and by the Fisher 
information matrix as explained at the end of the previous 
section. We used 100 trial points for the global optimiza- 
tion procedure and chose tighter constraints than above 
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for the rate constants ([0.01,1] for c\ and [0.1,10] for 
C2, C3) to have a convenient total running time. 

The results are depicted in Figure 3 for the fully observ- 
able system and in Figure 4 for the restricted system, 
where the state of the gene was not visible. In these figures 
we present the estimations of the parameters c\, C2, £3, or, 
and an estimation of the initial condition, i.e. the number 
of mRNA molecules at time point t = 0. Moreover, we 
give the total running time of the procedure (Figures 3f 
and 4f ). Our results are plotted as a gray landscape for all 
combinations of K and R. The estimates are bounded by a 
red grid enclosing an environment of one standard devia- 
tion around the respective average over all 100 estimates 
that we approximated. The real value of the parameter is 
indicated by a dotted blue rectangle. 

At first, we remark that neither the quality of the estima- 
tion nor the running time of our algorithm is significantly 
dependent on whether we observe the state of the gene 
in addition to the mRNA level or not. Moreover, con- 
cerning the estimation of all of the parameters, one can 
witness that the estimates converge more quickly against 
the real values along the K axis than the R axis and also the 
standard deviations decrease faster. Consequently, at least 
for the gene expression model, it is more advantageous 
to increase the number of observation sequences, than 
the number of measurements per sequence. For exam- 
ple, K = 100 sequences with only one observation each 
already provide enough information to estimate c\ up to 



a relative error of around 2.1%. Unfortunately, in this case 
the computation time is the highest since we have to com- 
pute K individual likelihoods (one for each observation 
sequence). Moreover, if R is small then the truncation 
of the state space is less efficient. The reason is that we 
have to integrate for a long time until we multiply with 
the weight matrix Wi. After this multiplication we decide 
which states contribute significantly to the likelihood and 
which states are neglected. We can, however, trade off 
accuracy against running time by varying K. 

For the measurement noise parameter a we see that it 
is more advantageous to increase R. Even five observa- 
tion sequences with a high number of observations per 
sequence (R = 100) suffice to estimate the noise up to a 
relative error of around 10.2%. For the estimation of the 
initial conditions, both K and R seem to play an equally 
important role. 

The standard deviations of the estimators give infor- 
mation about the accuracy of the estimation. In order 
to approximate the standard deviation we used statistics 
over 100 repeated experiments. In a realistic setting one 
would rather use the Fisher information matrix to approx- 
imate the standard deviation of the estimators since it 
is in most cases difficult to observe 100 • K observation 
sequences of a real system. Therefore we compare the 
results of one experiment with K observation sequences 
and standard deviations approximated using the Fisher 
information matrix to the case where the experiment is 
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Figure 2 Start points and gradient convergence of the optimization procedure for the gene expression example: Red pluses show the 
potential start points. We use 1 0, 1 00, and 1 000 start points in case (a), (b), and (c), respectively. The markers that are connected by lines show the 
iterative steps of the gradient convergence while the dashed blue line shows the true values of the parameters. We chose K = 5, R = 1 00 and 
assume that the parameters are in the range [ 1 0 -5 , 1 0 3 ]. 
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(©) c 3 (f) time (hours) 




Figure 3 Results of the gene expression case study with observable gene state. The dotted blue rectangle gives the true value of ci , c 2 , c 3 , o 

(obs. error), and mRNA(O). The red grid corresponds to the approximated standard deviation of the estimators. 



repeated 100 times. The results for varying values of K and 
R are given in Table 1 We observe that the approximation 
using the Fisher information matrix is in most cases close 
to the approximation based on 100 repetitions as long as K 
and R are not too small. This comes from the fact that the 
Fisher information matrix converges to the true standard 
deviation as the sample size increases. 

Multi-attractor model 

Our final example is a part of the multi-attractor model 
considered by Zhou et al. [22]. It consists of the three 
genes MafA, Pax4, and <5-gene, which interact with each 
other as illustrated in Figure 5. The corresponding pro- 
teins bind to specific promoter regions on the DNA and 
(de-) activate the genes. The reaction network has 2 3 dif- 
ferent gene states, also called modes, since each gene can 
be on or off. It is infinite in three dimensions since for 
the proteins there is no fixed upper bound. The edges 



between the nodes in Figure 5 show whether the protein of 
a specific gene can bind to the promoter region of another 
gene. Moreover, edges with normal arrow heads corre- 
spond to binding without inhibition while the edges with 
line heads show inhibition. 

We list all 24 reactions in Table 2 For simplicity we 
first assume that there is a common rate constant for all 
protein production reactions (p), for all protein degrada- 
tions (d), binding (b), and unbinding (u) reactions. We 
further assume that initially all genes are active and no 
proteins are present. For the rate constants we chose c = 
(p,d,b,u) = (5.0,0.1,1.0,1.0) and generated K e {1,5} 
sample paths of length T = 10.0. We added normally 
distributed noise with zero mean and standard deviation 
a = 1.0 to the protein levels at each of the R = 100 
observation time points. Plots of the generated observa- 
tion sequences are presented in Figure 1 b-d for the case 
K = 5. For the global optimization we used ten trial 
points. We chose the interval [0.1, 10] as a constraint for 
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Figure 4 Results of the gene expression case study (as in Figure 3) but the state of the gene is not observed. 



the rate constants p, b, u and the interval [ 0.01, 1] for d. 
We estimated the parameters for all 2 3 — 1 = 7 possi- 
bilities of observing or not observing the three protein 
numbers where at least one of them had to be observ- 
able. In addition we repeated the parameter estimation 
for the fully observable system where in addition to the 
three proteins also the state of the genes was observed. 



The results are depicted in Figure 6 where the #-axis of 
the plots refers to the observed proteins. For instance, the 
third entry on the #-axis of the plot in Figure 6 a shows 
the result of the estimation of parameter c\ = 5 based on 
observation sequences where only the molecule numbers 
of the proteins MafAProt and DeltaProt were observed. 
For this case study, we used the Fisher information matrix 



Table 1 Different approximations of the standard deviations of the estimators 



Method 


K 


R 


Cl 


C 2 


cs 


a 


mRNA(O) 


Fisher inf. matrix 


10 


10 


0.0545104 


0.561963 


0.935324 


0.364339 


0.639471 


100 experiments 






0.0358142 


0.198700 


0.262223 


0.392884 


0.490305 


Fisher inf. matrix 


20 


20 


0.0324508 


0.299487 


0.451476 


0.174095 


0.594820 


100 experiments 






0.0304157 


0.167431 


0.287471 


0.134506 


0.436059 


Fisher inf. matrix 


50 


50 


0.0139185 


0.110709 


0.152229 


0.0440282 


0.238033 


100 experiments 






0.0140331 


0.078516 


0.146232 


0.0353837 


0.183888 


Fisher inf. matrix 


100 


100 


0.00866066 


0.0548249 


0.0728129 


0.0182564 


0.208469 


100 experiments 






0.00691956 


0.0430123 


0.0641821 


0.0217544 


0.187968 
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Figure 5 Illustration of the multi-attractor model. 



to approximate the standard deviations of our estimators, 
plotted as bars in Figure 6 with the estimated parameter 
as midpoint. The fully observable case is labelled by "full". 

We observe in Figure 6 that as expected the accuracy 
of the estimation and the running time of our algorithm 
is best when we have full observability of the system and 
gets worse with an increasing number of unobservable 



Table 2 Chemical reactions of the multi-attractor model 



PaxDna 


p > 


PaxDna + PaxProt 


PaxProt 


d 


0 


PaxDna + DeltaProt 


b 


PaxDnaDeltaProt 


PaxDnaDeltaProt 


u ^ 


PaxDna + DeltaProt 


MafADna 


P 


MafADna + MafAProt 


MafAProt 


d ^ 


0 


MafADna + PaxProt 


b 


MafADnaPaxProt 


MafADnaPaxProt 


u 


MafADna + PaxProt 


MafADnaPaxProt 


P 


MafADnaPaxProt + MafAProt 


MafADna + MafAProt 


b 


MafADnaMafAProt 


MafADnaMafAProt 


u 


MafADna + MafAProt 


MafADnaMafAProt 


P 


MafADnaMafAProt + MafAProt 


MafADna + DeltaProt 


b ^ 


MafADnaDeltaProt 


MafADnaDeltaProt 


u 


MafADna + DeltaProt 


DeltaDna 


P 


DeltaDna + DeltaProt 


DeltaProt 


d ^ 


0 


DeltaDna + PaxProt 


b 


DeltaDnaPaxProt 


DeltaDnaPaxProt 


u 


DeltaDna + PaxProt 


DeltaDnaPaxProt 


P > 


DeltaDnaPaxProt + DeltaProt 


DeltaDna + MafAProt 


b 


DeltaDnaMafAProt 


DeltaDnaMafAProt 


u 


DeltaDna + MafAProt 


DeltaDna + DeltaProt 


b ^ 


DeltaDnaDeltaProt 


DeltaDnaDeltaProt 


u 


DeltaDna + DeltaProt 


DeltaDnaDeltaProt 


P 


DeltaDnaDeltaProt + DeltaProt 



species. Still the estimation quality is very high when 
five observation sequences are provided for almost all 
combinations and parameters. When only one observa- 
tion sequence is given (K = 1), the parameter estima- 
tion becomes unreliable and time consuming. This comes 
from the fact that the quality of the approximation highly 
depends on the generated observation sequence. It is pos- 
sible to get much better and faster approximations with 
a single observation sequence. However, we did not opti- 
mize our results but generated one random observation 
sequence and ran our estimation procedure once based 
on this. 

Recall that we chose common parameters p, d, b, u for 
production, degradation, and (un-)binding for all three 
protein species. Next we "decouple" the binding rates and 
estimate the binding rate of each protein independently. 
We illustrate our results in Figure 7. Again, in case of a 
single observation sequence (K = 1) the estimation is 
unreliable in most cases. If the true value of the param- 
eter is unknown, then the high standard deviation shows 
that more information (more observation sequences) is 
necessary to estimate the parameter. In order to esti- 
mate the binding rate of PaxProt, we see that observing 
MafAProt yields the best result while for the binding rate 
of MafAProt observing PaxProt is best. Only for the bind- 
ing rate of DeltaProt, the best results are obtained when 
the corresponding protein (DeltaProt) is observed. The 
running times of the estimation procedure are between 
10 and 80 h, usually increase with K and depend on the 
observation sequences. 

In Table 3 we list the results of estimating the produc- 
tion rate 5.0 in the multi-attractor model where we chose 
R = 100. More precisely, we estimated the production rate 
of each protein independently when the other two pro- 
teins were observed. Since the population of the PaxProt is 
significantly smaller than the populations of the other two 
proteins, its production rate is more difficult to estimate. 
The production rate of MafAProt is accurately estimated 
even if only a single observation sequence is consid- 
ered. For estimating the production rate of DeltaProt, 
K = 5 observation sequences are necessary to get an 
accurate result. 

Finally, we remark that for the multi-attractor model it 
seems difficult to predict whether for a given parameter 
the observation of a certain set of proteins yields a good 
accuracy or not. It can, however, be hypothesized that, if 
we want to accurately estimate the rate constant of a cer- 
tain chemical reaction, then we should observe as many 
of the involved species as possible. Moreover, it is rea- 
sonable that constants of reactions that occur less often 
are more difficult to estimate (such as the production of 
PaxProt). In such a case more observation sequences are 
necessary to provide reliable information about the speed 
of the reaction. 
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Figure 6 Parameter estimation results for the multi-attractor model. The x-axis shows the species that were observed during the estimation 
procedure. The dotted blue line corresponds to the true value of C], c 2 , cs, and c 4 , respectively. The error bars in (a)-(d) show the mean (plus/minus 
the standard deviation) of the estimators. In (e) we plot the running time of the estimation procedure. 



Conclusion 

Parameter inference for stochastic models of cellular 
processes demands huge computational resources. We 
proposed an efficient numerical method to approximate 
maximum likelihood estimators for a given set of obser- 
vations. We consider the case where the observations 



are subject to measurement errors and where only the 
molecule numbers of some of the chemical species are 
observed at certain points in time. In our experiments we 
show that if the observations provide sufficient informa- 
tion then parameters can be accurately identified. If only 
little information is available then the approximations of 
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Figure 7 Results of the multi-attractor (as in Figure 6), but we estimate the binding rate of each protein independently. 



the standard deviations of the estimators indicate whether As future work we plan a comparison of our technique 
more observations are necessary to accurately calibrate to parameter estimation based on Bayesian inference, 
certain parameters. In addition, we will examine whether a combination of 
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Table 3 Production rate estimation in the multi-attractor 
model 



Protein 


K 


Estimated rate 
constant 


Standard 
deviation 


Time 
(hours) 


Observed 
proteins 


PaxProt 


1 


10.0 


13.6159 


7.45 


MafAProt, 
DeltaProt 




5 


0.5693 


2.1842 


6.34 




MafAProt 


1 


4.9998 


4.9884 


11.62 


PaxProt, 
DeltaProt 




5 


5.4853 


2.3873 


13.86 




DeltaProt 


1 


2.5453 


1.8075 


4.35 


PaxProt, 
MafAProt 




5 


5.3646 


1 .4682 


12.39 





methods based on prior knowledge and the maximum 
likelihood method is useful Future plans further include 
parameter estimation methods for systems where some 
chemical species have small molecule numbers while oth- 
ers are high rendering a purely discrete representation 
infeasible. In such cases, hybrid models are advantageous 
where large populations are represented by continuous 
deterministic variables while small populations are still 
described by discrete random variables [23]. 



Additional files 



Additional file 1 : SBML file of the gene expression example. 

• File name: genexpression.xml 

• File format: SBML (see http://www.sbml.org/sbml/level2/version4) 

• File extension: xml 

Additional file 2: SBML file of the multiattractor model. 

• File name: multiattractor.xml 

• File format: SBML (see http://www.sbml.org/sbml/level2/version4) 

• File extension: xml 
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